08 re正则表达式

你想从一堆文本里提取所有邮箱地址，或者验证用户输入的手机号格式是否正确，或者把文档里所有的"Agent"替换成"智能体"——这些用普通的字符串方法很难做到，但用正则表达式可以轻松搞定。

正则表达式（Regular Expression）是一种模式匹配语言，用特殊的语法描述字符串的"形状"。re模块是Python处理正则表达式的标准库。

一、基础匹配

1.1 re.search()

扫描字符串，找到第一个匹配就返回。

python

import re

text = "我的邮箱是test@example.com，电话是13800138000"

# 搜索邮箱
match = re.search(r'\w+@\w+\.\w+', text)
if match:
    print(match.group())  # test@example.com

r''是原始字符串，让反斜杠保持原样。正则表达式模式建议都用原始字符串。

1.2 re.match()

只从字符串开头匹配。

python

import re

# match()只匹配开头
re.match(r'\d+', "123abc")   # 匹配成功
re.match(r'\d+', "abc123")   # 匹配失败（开头不是数字）

# search()会搜索整个字符串
re.search(r'\d+', "abc123")  # 匹配成功

1.3 re.fullmatch()

必须完全匹配整个字符串。

python

import re

re.fullmatch(r'\d+', "123")     # 匹配成功
re.fullmatch(r'\d+', "123abc")  # 匹配失败（有多余字符）

二、查找所有匹配

2.1 re.findall()

返回所有匹配的列表。

python

import re

text = "价格：99元、188元、299元"

# 找出所有数字
prices = re.findall(r'\d+', text)
print(prices)  # ['99', '188', '299']

2.2 re.finditer()

返回匹配的迭代器（适合大量匹配）。

python

import re

text = "邮箱1@test.com 和 邮箱2@example.org"

for match in re.finditer(r'\w+@\w+\.\w+', text):
    print(f"找到: {match.group()}, 位置: {match.span()}")
# 找到: 邮箱1@test.com, 位置: (3, 16)
# 找到: 邮箱2@example.org, 位置: (19, 33)

三、替换

3.1 re.sub()

替换匹配的内容。

python

import re

text = "Agent框架：LangChain、LangGraph、AutoGen"

# 替换所有逗号为分号
result = re.sub(r'、', '; ', text)
print(result)  # Agent框架：LangChain; LangGraph; AutoGen

# 把数字替换为*
result = re.sub(r'\d', '*', "电话：13800138000")
print(result)  # 电话：***********

# 使用函数作为替换
def double_number(match):
    return str(int(match.group()) * 2)

result = re.sub(r'\d+', double_number, "价格：10元和20元")
print(result)  # 价格：20元和40元

3.2 re.subn()

和sub()一样，但额外返回替换次数。

python

import re

result, count = re.subn(r'、', '; ', "A、B、C")
print(result)  # A; B; C
print(count)   # 2

四、分割

4.1 re.split()

按模式分割字符串。

python

import re

# 按多个分隔符分割
text = "apple;orange,banana|grape"
result = re.split(r'[;,|]', text)
print(result)  # ['apple', 'orange', 'banana', 'grape']

# 按空格分割（多个空格也算一个）
text = "hello   world   python"
result = re.split(r'\s+', text)
print(result)  # ['hello', 'world', 'python']

# 限制分割次数
result = re.split(r'\s+', text, maxsplit=1)
print(result)  # ['hello', 'world   python']

五、正则表达式语法

5.1 常用元字符

字符	含义	示例
`.`	匹配任意字符（除换行）	`a.c` 匹配 "abc"、"a1c"
`^`	字符串开头	`^Hello` 匹配以Hello开头
`$`	字符串结尾	`end$` 匹配以end结尾
`*`	0次或多次	`ab*c` 匹配 "ac"、"abc"、"abbc"
`+`	1次或多次	`ab+c` 匹配 "abc"、"abbc"，不匹配 "ac"
`?`	0次或1次	`ab?c` 匹配 "ac"、"abc"
`{m}`	精确m次	`a{3}` 匹配 "aaa"
`{m,n}`	m到n次	`a{2,4}` 匹配 "aa"、"aaa"、"aaaa"
`\|`	或	`cat\|dog` 匹配 "cat" 或 "dog"
`()`	分组	`(ab)+` 匹配 "ab"、"abab"

5.2 字符集

python

import re

# [abc]：匹配a、b或c
re.findall(r'[aeiou]', "hello")  # ['e', 'o']

# [a-z]：匹配a到z的任意字符
re.findall(r'[a-z]+', "Hello World 123")  # ['ello', 'orld']

# [^abc]：匹配除a、b、c外的任意字符
re.findall(r'[^0-9]+', "abc123def")  # ['abc', 'def']

# 常用字符集简写
# \d：数字，等价于[0-9]
# \D：非数字，等价于[^0-9]
# \w：单词字符，等价于[a-zA-Z0-9_]
# \W：非单词字符
# \s：空白字符（空格、制表符、换行等）
# \S：非空白字符

5.3 贪婪与非贪婪

默认是贪婪匹配（尽可能多地匹配），加?变成非贪婪：

python

import re

text = "<h1>标题</h1><p>段落</p>"

# 贪婪匹配（尽可能多地匹配）
re.findall(r'<.*>', text)
# ['<h1>标题</h1><p>段落</p>']

# 非贪婪匹配（尽可能少地匹配）
re.findall(r'<.*?>', text)
# ['<h1>', '</h1>', '<p>', '</p>']

六、分组与捕获

6.1 基本分组

python

import re

text = "2026-06-13"

# 用()分组
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)
if match:
    print(match.group(0))  # '2026-06-13'（整个匹配）
    print(match.group(1))  # '2026'（第一组）
    print(match.group(2))  # '06'（第二组）
    print(match.group(3))  # '13'（第三组）
    print(match.groups())  # ('2026', '06', '13')（所有组）

6.2 命名分组

python

import re

text = "2026-06-13"

# 命名分组：(?P<name>...)
match = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', text)
if match:
    print(match.group('year'))   # '2026'
    print(match.group('month'))  # '06'
    print(match.group('day'))    # '13'
    print(match.groupdict())     # {'year': '2026', 'month': '06', 'day': '13'}

6.3 findall()中的分组

python

import re

text = "价格：99元、188元、299元"

# 没有分组时，返回整个匹配
re.findall(r'\d+元', text)
# ['99元', '188元', '299元']

# 有分组时，只返回分组内容
re.findall(r'(\d+)元', text)
# ['99', '188', '299']

# 多个分组时，返回元组列表
text = "2026-06-13, 2025-12-25"
re.findall(r'(\d{4})-(\d{2})-(\d{2})', text)
# [('2026', '06', '13'), ('2025', '12', '25')]

七、编译正则表达式

如果同一个正则表达式要使用多次，可以编译它以提高性能：

python

import re

# 编译正则表达式
pattern = re.compile(r'\d+')

# 使用编译后的对象
pattern.findall("abc123def456")  # ['123', '456']
pattern.search("abc123")         # 匹配对象
pattern.sub('*', "abc123def")    # 'abc*def'

八、Match对象

匹配成功后返回Match对象，包含丰富的匹配信息：

python

import re

match = re.search(r'(\d+)-(\d+)', "日期: 2026-06")

# 匹配的字符串
match.group()      # '2026-06'
match.group(0)     # '2026-06'
match.group(1)     # '2026'
match.group(2)     # '06'

# 所有分组
match.groups()     # ('2026', '06')

# 匹配位置
match.start()      # 4（开始位置）
match.end()        # 12（结束位置）
match.span()       # (4, 12)

# 原始字符串
match.string       # '日期: 2026-06'

九、标志（Flags）

9.1 忽略大小写

python

import re

# re.I 或 re.IGNORECASE
re.findall(r'hello', "Hello HELLO hello", re.I)
# ['Hello', 'HELLO', 'hello']

9.2 多行模式

python

import re

text = """第一行
第二行
第三行"""

# re.M 或 re.MULTILINE：^和$匹配每行的开头和结尾
re.findall(r'^\w+', text, re.M)
# ['第一行', '第二行', '第三行']

9.3 点号匹配换行

python

import re

text = """<p>
内容
</p>"""

# re.S 或 re.DOTALL：让.匹配包括换行在内的所有字符
re.findall(r'<p>.*?</p>', text, re.S)
# ['<p>\n内容\n</p>']

9.4 组合使用

python

import re

# 同时使用多个标志
pattern = re.compile(r'^\w+', re.I | re.M)

十、常用正则表达式模式

10.1 验证邮箱

python

import re

def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

is_valid_email("test@example.com")  # True
is_valid_email("invalid.email")     # False

10.2 验证手机号

python

import re

def is_valid_phone(phone):
    pattern = r'^1[3-9]\d{9}$'
    return bool(re.match(pattern, phone))

is_valid_phone("13800138000")  # True
is_valid_phone("12345678901")  # False

10.3 提取URL

python

import re

def extract_urls(text):
    pattern = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
    return re.findall(pattern, text)

text = "访问 https://example.com 或 http://test.org/path"
extract_urls(text)
# ['https://example.com', 'http://test.org/path']

10.4 提取中文

python

import re

def extract_chinese(text):
    return re.findall(r'[\u4e00-\u9fff]+', text)

text = "Hello 你好 World 世界"
extract_chinese(text)
# ['你好', '世界']

10.5 清理HTML标签

python

import re

def strip_html(html):
    return re.sub(r'<[^>]+>', '', html)

strip_html("<p>Hello <b>World</b></p>")
# 'Hello World'

十一、实用技巧

11.1 安全地使用正则

python

import re

def safe_search(pattern, text, default=""):
    """安全地搜索，失败时返回默认值"""
    try:
        match = re.search(pattern, text)
        return match.group() if match else default
    except re.error:
        return default

11.2 替换时引用分组

python

import re

# 用\1、\2引用分组
text = "2026/06/13"
result = re.sub(r'(\d{4})/(\d{2})/(\d{2})', r'\1-\2-\3', text)
print(result)  # '2026-06-13'

# 用命名分组
result = re.sub(r'(?P<y>\d{4})/(?P<m>\d{2})/(?P<d>\d{2})', r'\g<y>-\g<m>-\g<d>', text)
print(result)  # '2026-06-13'

11.3 预编译提高性能

python

import re

# 在循环中使用时，预编译可以提高性能
EMAIL_PATTERN = re.compile(r'\w+@\w+\.\w+')

emails = []
for line in large_file:
    emails.extend(EMAIL_PATTERN.findall(line))

十二、总结

re模块的核心函数：

函数	作用
`re.search(pattern, string)`	搜索第一个匹配
`re.match(pattern, string)`	从开头匹配
`re.fullmatch(pattern, string)`	完全匹配
`re.findall(pattern, string)`	返回所有匹配的列表
`re.finditer(pattern, string)`	返回匹配的迭代器
`re.sub(pattern, repl, string)`	替换匹配
`re.split(pattern, string)`	按模式分割
`re.compile(pattern)`	编译正则表达式

常用元字符：

字符	含义
`.`	任意字符
`\d`	数字
`\w`	单词字符
`\s`	空白字符
`*`	0次或多次
`+`	1次或多次
`?`	0次或1次
`{m,n}`	m到n次
`[]`	字符集
`()`	分组
`\|`	或

正则表达式不用死记硬背，用的时候查文档就行。常用的就那几个：\d+匹配数字、\w+匹配单词、.*?非贪婪匹配。

08 re正则表达式 ​

一、基础匹配 ​

1.1 re.search() ​

1.2 re.match() ​

1.3 re.fullmatch() ​

二、查找所有匹配 ​

2.1 re.findall() ​

2.2 re.finditer() ​

三、替换 ​

3.1 re.sub() ​

3.2 re.subn() ​

四、分割 ​

4.1 re.split() ​

五、正则表达式语法 ​

5.1 常用元字符 ​

5.2 字符集 ​

5.3 贪婪与非贪婪 ​

六、分组与捕获 ​

6.1 基本分组 ​

6.2 命名分组 ​

6.3 findall()中的分组 ​

七、编译正则表达式 ​

八、Match对象 ​

九、标志（Flags） ​

9.1 忽略大小写 ​

9.2 多行模式 ​

9.3 点号匹配换行 ​

9.4 组合使用 ​

十、常用正则表达式模式 ​

10.1 验证邮箱 ​

10.2 验证手机号 ​

10.3 提取URL ​

10.4 提取中文 ​

10.5 清理HTML标签 ​

十一、实用技巧 ​

11.1 安全地使用正则 ​

11.2 替换时引用分组 ​

11.3 预编译提高性能 ​

十二、总结 ​

08 re正则表达式

一、基础匹配

1.1 re.search()

1.2 re.match()

1.3 re.fullmatch()

二、查找所有匹配

2.1 re.findall()

2.2 re.finditer()

三、替换

3.1 re.sub()

3.2 re.subn()

四、分割

4.1 re.split()

五、正则表达式语法

5.1 常用元字符

5.2 字符集

5.3 贪婪与非贪婪

六、分组与捕获

6.1 基本分组

6.2 命名分组

6.3 findall()中的分组

七、编译正则表达式

八、Match对象

九、标志（Flags）

9.1 忽略大小写

9.2 多行模式

9.3 点号匹配换行

9.4 组合使用

十、常用正则表达式模式

10.1 验证邮箱

10.2 验证手机号

10.3 提取URL

10.4 提取中文

10.5 清理HTML标签

十一、实用技巧

11.1 安全地使用正则

11.2 替换时引用分组

11.3 预编译提高性能

十二、总结